IBM Speech to Text

The IBM® Speech to Text service provides an Application Programming Interface (API) that lets you add speech transcription capabilities to your applications. To transcribe the human voice accurately, the service leverages machine intelligence to combine information about grammar and language structure with knowledge of the composition of the audio signal. The service continuously returns and retroactively updates the transcription as more speech is heard.

The service provides a variety of interfaces to suit the needs of your application. It supports many features that make it suitable for numerous use-cases. And it provides a customization interface that lets you enhance its base language and acoustic capabilities with vocabularies and acoustic characteristics specific to your domain, environment, and speakers. Supported interfaces

The Speech to Text service offers four interfaces:

  • A WebSocket interface for establishing persistent, full-duplex connections with the service for speech transcription.
  • An HTTP REST interface that supports both sessionless and session-based calls to the service for speech recognition.
  • An asynchronous HTTP interface that provides non-blocking calls to the service for speech recognition.
  • A customization interface that lets you expand the vocabulary of a base model with domain-specific terminology or adapt a base model for the acoustic characteristics of your audio.

SDKs are also available that simplify using the service's interfaces in various programming languages. For more information about application development with the service, see Overview for developers.

Speech to Text Parameters

acoustic_customization_id

An optional customization ID for a custom acoustic model that is adapted for the acoustic characteristics of your environment and speakers. By default, no custom model is used. See Custom models.

customization_id

An optional customization ID for a custom language model that includes terminology from your domain. By default, no custom model is used. See Custom models.

customization_weight

An optional double between 0.0 and 1.0 that indicates the relative weight that the service gives to words from a custom language model compared to those from the base vocabulary. The default is 0.3 unless a different weight was specified when the custom language model was trained. See Custom models.

inactivity_timeout

An optional integer that specifies the number of seconds for the service's inactivity timeout; use -1 to indicate infinity. The default is 30 seconds. See Inactivity timeout.

interim_results

An optional boolean that directs the service to return intermediate hypotheses that are likely to change before the final transcript. By default (false), interim results are not returned. See Interim results.

keywords

An optional array of keyword strings that the service spots in the input audio. By default, keyword spotting is not performed. See Keyword spotting.

keywords_threshold

An optional double between 0.0 and 1.0 that indicates the minimum threshold for a positive keyword match. By default, keyword spotting is not performed. See Keyword spotting.

max_alternatives

An optional integer that specifies the maximum number of alternative hypotheses that the service returns. By default, the service returns a single final hypothesis. See Maximum alternatives.

model

An optional model that specifies the language in which the audio is spoken and the rate at which it was sampled, broadband or narrowband. By default, the en-US_BroadbandModel model is used. See Language and models.

profanity_filter

An optional boolean that indicates whether the service censors profanity from a transcript. By default (true), profanity is filtered from the transcript. See Profanity filtering.

smart_formatting

An optional boolean that indicates whether the service converts dates, times, numbers, currency, and similar values into more conventional representations in the final transcript. By default (false), smart formatting is not performed. See Smart formatting.

speaker_labels

An optional boolean that indicates whether the service identifies which individuals spoke which words in a multi-participant exchange. By default (false), speaker labels are not returned. See Speaker labels.

timestamps

An optional boolean that indicates whether the service produces timestamps for the words of the transcript. By default (false), timestamps are not returned. See Word timestamps.

Transfer-Encoding

An optional value of chunked that causes the audio to be streamed to the service. By default, audio is sent all at once as a one-shot delivery. See Audio transmission.

watson-token

An optional authentication token that makes authenticated requests to the service without embedding your service credentials in every call. By default, service credentials must be passed with each request. See Authentication tokens and request logging.

word_alternatives_threshold

An optional double between 0.0 and 1.0 that specifies the threshold at which the service reports acoustically similar alternatives for words of the input audio. By default, word alternatives are not returned. See Word alternatives.

word_confidence

An optional boolean that indicates whether the service provides confidence measures for the words of the transcript. By default (false), word confidence measures are not returned. See Word confidence.

X-Watson-Authorization-Token

An optional authentication token that makes authenticated requests to the service without embedding your service credentials in every call. By default, service credentials must be passed with each request. See Authentication tokens and request logging.

X-Watson-Learning-Opt-Out

An optional boolean that indicates whether you opt out of the request logging that IBM performs to improve the service for future users. By default (false), request logging is performed. See Authentication tokens and request logging.

Import Libraries

Bring in a few libraries we need and extend DSX with some libraries not currently installed.


In [ ]:
!pip install --upgrade watson_developer_cloud

import requests
import json
import os

from os.path import join, dirname
from watson_developer_cloud import SpeechToTextV1

Authentication Handling and File Details

Parameters for our Authentication Credentials for our Speech to Text API and the Input Files we'll be leveraging.

ATTENTION: SET SPEECH SERVICE CREDENTIALS


In [ ]:
# @hidden_cell
url = "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
username= "$USERNAME" 
password= "$PASSWORD"

file1 = "https://github.com/krondor/nlp-dsx-pot/raw/master/aging.mp3"
file2 = "http://podcast.c-span.org/podcast/SBHAR1020.mp3"
file3 = "https://github.com/krondor/nlp-dsx-pot/raw/master/reagan-thatcher.mp3"

Basic transcription with CURL

By default, the service returns a basic transcription of the input audio. The following example cURL command submits a brief FLAC file named audio-file.flac External link icon with no additional output parameters, and the service returns basic transcription results.


In [ ]:
!wget {file1} -O aging.mp3 -nc

# Define Local File for CURL
filepath = './reagan-thatcher.mp3'

!curl -X POST -u {username}:{password} \
    --header "Content-Type: audio/mp3" \
    --data-binary @{filepath} \
    "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize"

Output Handling with Requests

By default, the service returns a basic transcription of the input audio. In this example we will use the requests library to structure our post and pass parameters to the service. We will define diarization requirement and select our speaking model. The results will be passed to a pandas data frame and subsequently analyzed.


In [ ]:
!wget {file3} -O reagan-thatcher.mp3 -nc 

filename = os.path.basename(filepath)

audio = open(filename,'rb')

files_input = {
    "audioFile":(filename,audio,'audio/mp3')    
}

# Define Speech to Text Feature Parameters
params = (
    ('model', 'en-US_NarrowbandModel'),
    ('speaker_labels', 'true')
)

response = requests.post(
    url, 
    params=params,
    auth=(username, password), 
    headers={"Content-Type": "audio/mp3"},
    files=files_input)

response_data = response.json()
print('status_code: {} (reason: {})'.format(response.status_code, response.reason))

Pandas from Results


In [ ]:
import pandas as pd

data = []

for item in response_data['results']:
    for trans in item['alternatives']:
        data.append(dict({'transcript':trans['transcript'], 'confidence':trans['confidence']}))

# Create Pandas Data Frame of Transcript Results with Confidence
df = pd.DataFrame(data)

# View Snippet
df.head(5)

Confidence Spread

Let's plot the confidence per transcription snippet to get a rough idea of the data integrity in our transcription.


In [ ]:
%matplotlib inline

import numpy as np
import matplotlib

matplotlib.style.use('ggplot')

plt.figure();

df['confidence'].plot.hist()

Speech to Text with Watson Developer Cloud SDK


In [ ]:
speech_to_text = SpeechToTextV1(
    username='d9f7864a-0869-40ee-98af-58e23e996a0e',
    password='nYmlwq7VBTZz',
    x_watson_learning_opt_out=False
)

!wget {file1} -O aging.mp3 -nc

filepath = './aging.mp3'  # path to file
filename = os.path.basename(filepath)

print(json.dumps(speech_to_text.models(), indent=2))

print(json.dumps(speech_to_text.get_model('en-US_BroadbandModel'), indent=2))

with open(filename, 'rb') as audio_file:
    print(json.dumps(speech_to_text.recognize(
        audio_file, content_type='audio/mp3', timestamps=True,
        word_confidence=True, speaker_labels=True),
        indent=2))

Extra Credit

The websockets API provides for streaming interface to speech to text translation.

Speech to Text Websockets


In [ ]: